Data science analysis to provide Hilton Hotel International with recommendation on strategies that would lead to revenue growth.
The solutions to the following questions will help us answer our research question:
The data is originally from the article Hotel Booking Demand Datasets, by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.
How many bookings were canceled? ● What was the booking ratio between resort hotels and city hotels? ● What was the percentage of booking for each year? ● Which was the busiest month for hotels? ● Most guests came from which country? ● How long do most people stay in hotels? ● Which was the most booked accommodation type (Single, Couple, Family)?
How data was relevant to our question.
#install.packages("tidyverse")
#install.packages("readr")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
# install.packages("reshape")
library(reshape)
##
## Attaching package: 'reshape'
##
## The following object is masked from 'package:dplyr':
##
## rename
##
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
# Load the data below
# ---
# Dataset url = https://bit.ly/2WvQbhJ
# ---
# YOUR CODE GOES BELOW
#
hotel_booking_df <- read.csv("hotel_bookings.csv", header = TRUE, na.strings=c("NULL",NA))
hotel_booking_df
# Checking the first 5 rows of data
# ---
# YOUR CODE GOES BELOW
#
head(hotel_booking_df, 5)
# Checking the last 5 rows of data
# ---
# YOUR CODE GOES BELOW
#
tail(hotel_booking_df, 5)
# Sample 10 rows of data
# ---
# YOUR CODE GOES BELOW
#
sample_n(hotel_booking_df, 10)
# Checking number of rows and columns
# ---
# YOUR CODE GOES BELOW
#
dim(hotel_booking_df)
## [1] 119390 32
# Checking datatypes
# ---
# YOUR CODE GOES BELOW
#
str(hotel_booking_df)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : int NA NA NA 304 240 240 NA 303 240 15 ...
## $ company : int NA NA NA NA NA NA NA NA NA NA ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
Record your general observations below:
Observation 1:It was observed that there were columns with null values i.e agent and company. Observation 2
The data is originally from the article Hotel Booking Demand Datasets, by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.
# Checking datatypes and missing entries of all the variables
# ---
# YOUR CODE GOES BELOW
#
str(hotel_booking_df)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : int NA NA NA 304 240 240 NA 303 240 15 ...
## $ company : int NA NA NA NA NA NA NA NA NA NA ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
sum(is.na(hotel_booking_df))
## [1] 129425
We observe the following from our dataset:
Observation 1 Observation 2: There are 129425 missing values in our data set
# Checking how many duplicate rows are there in the data
# ---
# YOUR CODE GOES BELOW
#
#
sum(duplicated(hotel_booking_df))
## [1] 31994
I got 31994 duplicated entries .
# Checking if any of the columns are all null
# ---
# YOUR CODE GOES BELOW
#
which(colSums(is.na(hotel_booking_df)) == ncol(hotel_booking_df))
## named integer(0)
Observation 1: I could not find any columns in which all the data is null
# Checking if any of the rows are all null
# ---
# YOUR CODE GOES BELOW
which(rowSums(is.na(hotel_booking_df)) == nrow(hotel_booking_df))
## integer(0)
Observation 1: I could not find any row in which all the data is null Observation 2
# Checking the correlation of the features through the use of
# visualizations the correlation using heatmap
# ---
# YOUR CODE GOES BELOW
#
sample(hotel_booking_df,10)
# Select the desired columns
hotel_booking_df2 <- hotel_booking_df %>%
na.omit() %>%
select(is_canceled, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, days_in_waiting_list, total_of_special_requests)
hotel_booking_df2
# Create a correlation matrix
corr_matrix <- cor(hotel_booking_df2, method="s")
## Warning in cor(hotel_booking_df2, method = "s"): the standard deviation is zero
head(corr_matrix)
## is_canceled stays_in_weekend_nights
## is_canceled 1.00000000 -0.1438928
## stays_in_weekend_nights -0.14389278 1.0000000
## stays_in_week_nights -0.08137791 0.7794933
## adults -0.04922699 -0.0664910
## children 0.05446082 -0.1683479
## babies NA NA
## stays_in_week_nights adults children babies
## is_canceled -0.08137791 -0.04922699 0.05446082 NA
## stays_in_weekend_nights 0.77949325 -0.06649100 -0.16834790 NA
## stays_in_week_nights 1.00000000 -0.15183455 -0.12611588 NA
## adults -0.15183455 1.00000000 0.20245466 NA
## children -0.12611588 0.20245466 1.00000000 NA
## babies NA NA NA 1
## is_repeated_guest previous_cancellations
## is_canceled -0.01183921 -0.01983730
## stays_in_weekend_nights -0.13333459 -0.02758733
## stays_in_week_nights -0.26416244 -0.08427287
## adults -0.07395879 -0.05488996
## children 0.17358125 -0.01147280
## babies NA NA
## previous_bookings_not_canceled days_in_waiting_list
## is_canceled -0.005786814 NA
## stays_in_weekend_nights -0.129795621 NA
## stays_in_week_nights -0.233541874 NA
## adults -0.133680418 NA
## children 0.190626508 NA
## babies NA NA
## total_of_special_requests
## is_canceled -0.11646585
## stays_in_weekend_nights -0.12033640
## stays_in_week_nights -0.24581095
## adults 0.01722921
## children 0.09985491
## babies NA
# Create a table with long form
corr_df <- melt(corr_matrix)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
corr_df
# Plot the heatmap
ggplot(corr_df, aes(X1, X2, fill = value)) +
geom_tile(color = "black") +
geom_text(
aes(label = round(value, 2)),
color = "white"
) +
coord_fixed() +
labs(
fill="Pearson Correlation"
) +
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
limit = c(-1,1)
) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
## Warning: Removed 38 rows containing missing values (geom_text).
We observe the following from our dataset:
Observation 1 Observation 2
# Dropping company column because it has alot of missing values
# and we won't need to answer any of our questions
# ---
# YOUR CODE GOES BELOW
#
hotel_booking_df = select(hotel_booking_df, -company)
head(hotel_booking_df)
From the data variable description we see that the Distribution Channel category that tells us about Booking distribution.
The term “TA” means “Travel Agents” The term “TO” means “Tour Operators” This allows us to fill the missing values in the agents column with TO
# We replace the mising values i.e. for TO
# ---
# YOUR GOES BELOW
#
hotel_booking_df$agent[is.na(hotel_booking_df$agent)] <- "TO"
hotel_booking_df
# We drop rows where there is no adult, baby and child as
# these records won't help us.
# ---
# YOUR GOES BELOW
#
hotel_booking_df <- subset(hotel_booking_df, adults!=0 | children!=0 | babies!=0)
hotel_booking_df
# We replace missing children values with rounded mean value
# ---
# Hint i.e. use round()
# ---
# YOUR GOES BELOW
#
hotel_booking_df$children[is.na(hotel_booking_df$children)] <- round(mean(hotel_booking_df$children, na.rm = TRUE), 0)
hotel_booking_df